Algorithm for Model Validation: Theory and Applications 



D. Sornette 1 ' 2 - 3 , A. B. Davis 4 , K. Ide 1 ' 5 , K. R. Vixie 6 , V. Pisarenko 7 , and J. R. Kamm 8 

1 Institute of Geophysics and Planetary Physics, University of California, Los Angeles, CA 90095, USA 

2 Department of Earth and Space Sciences, University of California, Los Angeles, CA 90095, USA 
and Laboratoire de Physique de la Matiere Condensee, CNRS UMR 6622, 
Universite de Nice-Sophia Antipolis, 06108 Nice Cedex 2, France 

3 now at D-MTEC, ETH Zurich, CH-8032 Zurich, Switzerland 

4 Los Alamos National Laboratory, Space and Remote Sensing Group (ISR-2), 
Los Alamos, NM 87545, USA 



5 



Department of Atmospheric and Oceanic Sciences, University of California, Los Angeles, CA 90095, 

USA 

6 Los Alamos National Laboratory, Mathematical Modeling and Analysis Group (T-7), 

Los Alamos, New Mexico 87545, USA 

7 International Institute of Earthquake Prediction Theory and Mathematical Geophysics, 
Russian Academy of Sciences, Warshavskoye sh., 79, kor. 2, Moscow 113556, Russia 

8 Los Alamos National Laboratory, Applied Science and Methods Development Group (X-l), 

Los Alamos, New Mexico 87545, USA 



Abstract: Validation is often defined as the process of determining the degree to which a model is an 
accurate representation of the real world from the perspective of its intended uses. Validation is crucial as 
industries and governments depend increasingly on predictions by computer models to justify their deci- 
sions. We propose to formulate the validation of a given model as an iterative construction process that 
mimics the often implicit process occurring in the minds of scientists. We offer a formal representation of 
the progressive build-up of trust in the model. We thus replace static claims on the impossibility of vali- 
dating a given model by a dynamic process of constructive approximation. This approach is better adapted 
to the fuzzy, coarse-grained nature of validation. Our procedure factors in the degree of redundancy ver- 
sus novelty of the experiments used for validation as well as the degree to which the model predicts the 
observations. We illustrate the new methodology first with the maturation of Quantum Mechanics as the 
arguably best established physics theory and then with several concrete examples drawn from some of 
our primary scientific interests: a cellular automaton model for earthquakes, a multifractal random walk 
model for financial time series, an anomalous diffusion model for solar radiation transport in the cloudy 
atmosphere, and a computational fluid dynamics code for the Richtmyer-Meshkov instability. 
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Introduction: Model Construction and Validation 

At the heart of the scientific endeavor, model building involves a slow and arduous selection process, 
which can be roughly represented as proceeding according to the following steps: (1) start from observa- 
tions and/or experiments; (2) classify them according to regularities that they may exhibit: the presence 
of patterns, of some order, also sometimes referred to as structures or symmetries, is begging for "expla- 
nations" and is thus the nucleation point of modeling; (3) use inductive reasoning, intuition, analogies, 
and so on, to build hypotheses from which a model [1] is constructed; (4) test the model obtained in step 
3 with available observations, and then extract predictions that are tested against new observations or by 
developing dedicated experiments. The model is then rejected or refined by an iterative process, a loop 
going from (1) to (4). A given model is progressively validated by the accumulated confirmations of its 
predictions by repeated experimental and/or observational tests. 

Using a model requires a language, i.e., a vocabulary and syntax, to express it. The language can 
be English or French to obtain predicates specifying the properties of and/or relation with the subject(s). 
It can be mathematics which is arguably the best language to formalize the relation between quantities, 
structures, space and change. It can be a computer language to implement a set of relations and instructions 
logically linked in a computer code to obtain quantative outputs in the form of string of numbers. In this 
later version, validation must be distinguished from verification: whereas verification deals with whether 
the simulation code correctly solves the model equations, validation carries an additional degree of trust 
in the value of the model vis-a-vis experiment and, therefore, may convince one to use its predictions to 
explore beyond known territories [2]. 

The validation of models is becoming a major issue as humans are increasingly faced with decisions 
involving complex tradeoffs in problems with large uncertainties, as for instance in attempts to control 
the growing anthropogenic burden on the planet [3] within a risk-cost framework [4] based on predictions 
of models. For policy decisions, federal, state, and local governments increasingly depend on computer 
models that are scrutinized by scientific agencies to attest to their legitimacy and reliability. Cognizance 
of this trend and its scientific implications is not lost on the engineering [5] and physics [6] communities. 

How does one validate a model when it makes predictions on objects that are not fully replicated in the 
laboratory, either in the range of variables, of parameters or of scales? Indeed, a potentially far-reaching 
consequence of validation is to give the "green light" for extrapolating a body of knowledge, which is 
firmly established only in some limited ranges of variables, parameters and scales. Predictive capability is 
what enables us to go beyond this clearly defined domain into a more fuzzy area of unknown conditions 
and outcomes. This problem has repeatedly appeared in different guises in practically all scientific fields. 
A notable domain of application is risk assessment: see for instance the classic paper on risks [7], and the 
instructive history of quantitative risk analysis in US regulatory practice [8], especially in the U.S. nuclear 
power industry [9, 10, 11, 12]. 

An accute question in risk assessment deals with the question of quantifying the potential for a catas- 
trophic event (earthquake, tornado, hurricane, flood, huge solar mass ejection, large bolide, industrial plant 
explosion, ecological disaster, financial crash, economic collapse, etc.) of amplitude never yet sampled 
from the knowledge of past history and present understanding. This is crucial, for example, in the problem 
of scaling the physics of material and rock rupture tested in the laboratory to the scale of earthquakes. 
This is necessary for scaling the knowledge of hydrodynamical processes quantified in the laboratory to 
the length and time scales relevant to the atmospheric/oceanic weather and climate, not to mention astro- 
physical systems. Perhaps surprisingly, the same problem arises in the evaluation of electronic circuits 
[13]: "The problem is that there is no systematic way to determine the range of applicability of the models 
provided within circuit simulator component libraries." The example of validation of electronic circuits is 
particularly interesting because it identifies the origin of the difficulties inherent in validation: the fact that 
the dynamics are strongly nonlinear and complex with threshold effects, that it does not allow for a simple- 
minded analytic approach consisting in testing a circuit component by component. This same difficulty is 
found in validating general circulation models of the Earth's climate or end-to-end computer simulations 
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of complex engineering systems such as an aircraft or a nuclear weapon. The problem is fundamentally 
due to its systemic nature. 

The theory of systems, sometimes referred to as the theory of complex systems, is characterized by the 
occurrence of surprises. The biggest one may be the phenomenon of "emergence" in which qualitatively 
new processes or structures appear in the collective behavior of the system, while they can rarely be 
derived or guessed from the behavior of each element. The phenomenon of "emergence" is similar to the 
philosophical law on the "transfer of the quantity into the quality." A full control of the validation process 
requires to account for this emergence phenomenon, because it may contribute to the epistemic uncertainty 
(the uncertainty attributable to incomplete knowledge about a phenomenon that affects our ability to model 
it) associated with so-called "unknown unknowns." 

Impossibility Statements 

For these reasons, the possibility to validate numerical models of natural phenomena, often endorsed either 
implicitly or identified as reachable goals by natural scientists in their daily work, has been challenged; 
quoting Oreskes et al. [14]: "Verification and validation of numerical models of natural systems is impos- 
sible. This is because natural systems are never closed and because model results are always non-unique." 
According to this view, the impossibility of "verifying" or "validating" models is not limited to computer 
models and codes but to all theories that rely necessarily on imperfectly measured data and auxiliary hy- 
potheses, as Sterman et al. [15] put it: "Any theory is underdetermined and thus unverifiable, whether it 
is embodied in a large-scale computer model or consists of the simplest equations." Accordingly, many 
uncertainties undermine the predictive reliability of any model of a complex natural system in advance of 
its actual use. 

Such "impossibility" statements are reminiscent of other "impossibility theorems." Consider the math- 
ematics of algorithmic complexity [16], which provides one approach to the study of complex systems. 
Following reasoning related to that underpinning Godel's incompleteness theorem, most complex systems 
have been proved to be computationally irreducible, i.e., the only way to predict their evolution is to ac- 
tually let them evolve in time. Accordingly, the future time evolution of most complex systems appears 
inherently unpredictable. Such sweeping statements turn out to have basically no practical value. This 
is because, in physics and other related sciences, one aims at predicting coarse-grained properties. Only 
by ignoring most of molecular detail, for example, did researchers ever develop the laws of thermody- 
namics, fluid dynamics and chemistry. Physics works and is not hampered by computational irreducibility 
because we only ask for approximate answers at some coarse-grained level [17]. By developing exact 
coarse-grained procedures on computationally irreducible cellular automata, Israeli and Goldenfeld [18] 
have demonstrated that prediction may simply depend on finding the right level for describing the system. 
More generally, we argue that only coarse-grained scales are of interest in practice but their description 
requires effective laws which are in general based on finer scales. In other words, real understanding must 
be rooted in the ability to predict coarser scales from finer scales, i.e., a real understanding solves the 
universal micro-macro challenge. Similarly, we propose that validation is possible, to some degree, as 
explained below. 

Validation and Hypothesis Testing 

We start by recognizing that validation is closely related to hypothesis testing and statistical significance 
tests of mathematical statistics [19], a point made previously by several others authors [20, 21, 22, 23, 24]. 
In hypothesis testing, a null Hq is compared with an alternative hypothesis Hi, in their ability to explain 
and fit data. The result of the test is either to "reject Hq in favor of Hi" or "not reject Hq." One never 
concludes "reject Hi," or even "accept Hq or Hi" If one concludes "do not reject Ho," this does not neces- 
sarily mean that the null hypothesis is true, it only suggests that there is not sufficient evidence against Hq 
in favor of Hi; rejecting the null hypothesis may suggest but does not prove that the alternative hypothesis 
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is true, only that it is better given the data. Thus, one can never prove that an hypothesis is true, only that 
it is less effective in explaining the data than another hypothesis. One can also conclude that an hypothesis 
Hi is not necessary and the other more parsimonious hypothesis Hq should be favored. The alternative 
hypothesis Hi is not rejected, strictly speaking, but can be found unnecessary or redundant with respect to 
Hq. This is the situation when there are two (or several) alternative hypotheses Hq and Hi, which can be 
composite, nested, or non-nested (the technical difficulties of hypothesis testing depends on these struc- 
tures of the competing hypotheses [25]). This illuminates the status of code comparison in verification and 
validation [26]. Viewed in this way, it is clear why code comparison alone, i.e., independent of compari- 
son to observations/experiments, is not sufficient for validation since validation requires comparison with 
experiments and several other steps described below. The analogy with hypothesis testing makes clear 
that code comparison allows the selection of one code among several codes but does not help to conclude 
about the validity of a given code or model when considered as a unique entity independently of other 
codes or models. We should stress that the Sandia report [26] presents an even more negative view of 
code comparisons because it addresses the common practice in the computer community that turns to code 
comparisons rather than real verification or validation, without any independent referents. Here, using the 
analogy with hypothesis testing, we have taken a more positive view of "Code 1 versus referent compared 
with Code 2 versus reference," leading to an inference about which code is better based on the compara- 
tive performance with the data. While some will consider this as real validation, this procedure does not 
address the challenges raised earlier, which justifies the algorithm delineated in following sections. 

In the theory of hypothesis testing, there is a second class of tests, called "tests of significance," in 
which one considers a unique hypothesis Hq (model), and the alternative is "all the rest," i.e., all hypotheses 
that differ from Hq. In that case, the conclusion of a test can be the following: "this data sample does 
not contradict the hypothesis Hq," which is of course not the same as "the hypothesis Hq is true." In 
other words, a test of significance cannot "accept" an hypothesis, it can only fail to reject it because the 
hypothesis is found sufficient at some confidence level for explaining the available data. Multiplying the 
tests will not help in accepting Hq. 

Since validation must at least contain hypothesis testing, this shows that statements like "verification 
and validation of numerical models of natural systems is impossible" [14] are best rephrased in the lan- 
guage of mathematical statistics [19]: the theory of statistical hypothesis testing has taught mathematical 
and applied statisticians for decades that one can never prove an hypothesis or a model to be true. One 
can only develop an increasing trust in it by subjecting it to more and more tests which "do not reject it." 
We attempt to formalize below how such trust can be built up to lead to validation viewed as an evolving 
process. 

Validation as a Constructive Iterative Process 

In a standard exercise of model validation, one performs an experiment and, in parallel, runs the cal- 
culations with the available model. Then, a comparison between the measurements of the experiment 
and the outputs of the model calculations is performed. This comparison uses some metrics controlled 
by experimental feasibility, i.e., what can actually be measured. One then iterates by refining the model 
until (admittedly subjective) satisfactory agreement is obtained. Then, another set of measurements is 
performed, which is compared with the corresponding predictions of the model. If the agreement is still 
satisfactory without modifying the model, this is considered progress in the validation of the model. It- 
erating with experiments testing different features of the model corresponds to mimicking the process of 
construction of a theory in physics [27]. As the model is exposed to increasing scrutiny and testing, the 
testers develop a better understanding of the reliability (and limitations) of the model in predicting the 
outcome of new experimental and/or observational set-ups. This implies that "validation activity should 
be organized like a project, with goals and requirements, a plan, resources, a schedule, and a documented 
record" [6]. 

Extending previous proposals [20, 21, 22, 23, 24], we thus propose to formulate the validation problem 
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of a given model as an iterative construction that embodies the often implicit process occurring in the minds 
of scientists: 

1. One starts with an a priori trust quantified by the value Vp r i or in the potential value of the model. This 
quantity captures the accumulated evidence thus far. If the model is new or the validation process is 
just starting, take Vp r i or = 1. As we will soon see, the absolute value of Vp r i or is unimportant but its 
relative change is important. 

2. An experiment is performed, the model is set-up to calculate what should be the outcome of the 
experiment, and the comparison between these predictions and the actual measurements is made 
either in model space or in observation space. The comparison requires a choice of metrics. 

3. Ideally, the quality of the comparison between predictions and observations is formulated as a sta- 
tistical test of significance in which an hypothesis (the model) is tested against the alternative, which 
is "all the rest." Then, the formulation of the comparison will be either "the model is rejected" (it 
is not compatible with the data) or "the model is compatible with the data." In order to implement 
this statistical test, one needs to attribute a likelihood p(M\y \> s ) or, more generally, a metric-based 
"grade" that quantifies the quality of the comparison between the predictions of the model M and 
observations y ^ s . This grade is compared with the reference likelihood q of "all the rest." Examples 
of implementations include the sign test and the tolerance interval methods [28]. In many cases, one 
does not have the luxury of a likelihood; one has then to resort to more empirical notations of how 
well the model explains crucial observations. In the most complex cases, these notations can be 
binary (accepted or rejected). 

4. The posterior value of the model is obtained according to a formula of the type 

^posterior/ Vprior = F [p(M\y ohs ), q; 

Cnovci ] • (1) 

In this expression, Vp OS terior is the posterior potential, or coefficient, of trust in the value of the 
model after the comparison between the prediction of the model and the new observations have 
been performed. By the action of F(- ■ ■), Vp OS terior can be either larger or smaller than Vp r i or : in 
the former case, the experimental test has increased our trust in the validity of the model; in the 
later case, the experimental test has signaled problems with the model. One could call Vp r i or and 
^posterior the evolving "potential value of our trust" in the model or, loosely paraphrasing the theory 
of decision making in economics, the "utility" of the model [31]. 

The transformation from the potential value Vp r i or of the model before the experimental test to Vp OS terior 
after the test is embodied into the multiplier F, which can be either larger than 1 (towards validation) or 
smaller than 1 (towards invalidation). We postulate that F depends on the grade p(M\y \^ s ), to be inter- 
preted as proportional to the probability of the model M given the data y b s - It is natural to compare this 
probability with the reference likelihood q that one or more of all other conceivable models is compatible 
with the same data. 

The multiplier F depends also on a parameter c novc i that quantifies the importance of the test. In other 
words, c n ovci is a measure of the impact of the experiment or of the observation, that is, how well the 
new observation explores novel "dimensions" of the parameter and variable spaces of both the process 
and the model that can reveal potential flaws. A fundamental challenge is that the determination of c nove i 
requires, in some sense, a pre-existing understanding of the physical processes so that the value of a 
new experiment can be fully appreciated. In concrete situations, one has only a limited understanding of 
the physical processes and the value of a new observation is only assessed after a long learning phase, 
after comparison with other observations and experiments, as well as after comparison with the model 
making c novc i possibly self -referencing. Thus, we consider c novc i is basically a judgment-based weighting 
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of experimental referents, in which judgment (for example, by a subject matter expert) is dominant in its 
determination. The fundamental problem is to quantify the relevance of a new experimental referent for 
validation to a given decision-making problem, given that the experimental domain of the test does not 
overlap with the application domain of the decision. Assignment of c novc i requires the judgment of subject 
matter experts, whose opinions will likely vary. This variability must be acknowledge (if not accounted for 
however naively) in assigning c novc i. Thus, providing an a priori value for c nov ci> as required in expression 
(1), remains a difficult and key step in the validation process. This difficulty is similar to specifying the 
utility function in decision making [31]. 

Repeating an experiment twice is a special degenerate case since it amounts ideally to increasing the 
statistical size of the sample. In such a situation, one should aggregate the two experiments 1 and 2 
(yielding the relative likelihoods p\jq and p2 /q respectively) graded with the same c noV ei into an effective 
single test with the same c novc i and likelihood (pi/q){p2/q)- This is the ideal situation, as there are cases 
where repeating an experiment may wildly increase the presence of epistemic uncertainty (or demonstrate 
uncontrolled variability or other kinds of problems). When this occurs, this means that the assumption that 
there is no surprise, no novelty, in repeating the experiment is incorrect. Then, the two experiments should 
be treated so as to contribute two multipliers F's, because they reveal different kinds of uncertainty that 
can be generated by ensembles of experiments. 

One experimental test corresponds to a entire loop 1 — 4 transforming a given Vp r i or to a Vp OS tcrior 
according to (1). This ^posterior becomes the new V^ r i or for the next test, which will transform it into 
another ^posterior and so on, according to the following iteration process: 

V (l) ^V {1 \ ■ =V {2) ^V {2 \ ■ =V {3) - ^V {n \ ■ (2) 

prior posterior prior posterior prior posterior ^ ' 

After n validation loops, we end up with a posterior trust in the model given by 

^olrior/Cr = F Mv W Us , ^ c£J ...F [^(M|£), q^ ; C^J , (3) 

(i) 

where the product is time-ordered since the sequence of values for c^ vcl depend on preceding tests. Vali- 

(n) 

dation can be said to be asymptotically satisfied when the number of steps n and the final value K )OSterior 
are sufficiently high. How high is high enough is subjective and may depend on both the application and 
programmatic constraints. The concrete examples discussed below offer some insight on this issue. This 
construction makes clear that there is no absolute validation, only a process of corroborating or disproving 
steps competing in a global valuation of the model under scrutiny. The product (3) expresses the as- 
sumption that successive observations give independent multipliers. This assumption keeps the procedure 
simple because determining the dependence between different tests with respect to validation would be 
highly undetermined. We propose that it is more convenient to measure the dependence through the single 
parameter c riovcl quantifying the novelty of the jth test with respect to those preceding it. In full generality, 
each new F multiplier should be a function of all previous tests. 

The loop 1 — 4 together with expression (1) are offered as an attempt to quantify the progression of 
the validation process, so that eventually, when several approximately independent tests exploring different 
features of the model and of the process have been performed, ^posterior has grown to a level at which most 
experts will be satisfied and will believe in the validity of the model. This formulation has the advantage 
of viewing the validation process as a convergence or divergence built on a succession of steps, mimicking 
the construction of a theory of reality [32]. Expression (3) embodies the progressive build-up of trust in 
a model or theory. This formulation provides a formal setting for discussing the difficulties that underlay 
the so-called impossibilities [14, 15] in validating a given model. Here, these difficulties are not only 
partitioned but quantified: 

• in the definition of "new" non-redundant experiments (parameter c noV ei)> 
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• in choosing the metrics and the corresponding statistical tests quantifying the comparison between 
the model and the measurements of this experiment (leading to the likelihood ratio p/q), and 

• in iterating the procedure so that the product of the gain/loss factors Vpostcrior/^prior obtained after 
each test eventually leads to a clear-cut conclusion after several tests. 

This formulation makes clear why and how one is never fully convinced that validation has been obtained: 
it is a matter of degree, of confidence level, of decision making, as in statistical testing. But, this formula- 
tion helps in quantifying what new confidence (or distrust) is gained in a given model. It emphasizes that 
validation is an ongoing process, similar to the never-ending construction of a theory of reality. 

The general formulation proposed here in terms of iterated validation loops is intimately linked with 
decision theory based on limited knowledge: the decision to "go ahead" and use the model is fundamen- 
tally a decision problem based on the accumulated confidence embodied in ^posterior- The "go/no-go" 
decision must take into account conflicting requirements and compromise between different objectives. 
Decision theory, created by the statistician Abraham Wald in the late forties, is based ultimately on game 
theory [31, 33]. Wald [34] used the term loss function, which is the standard terminology used in mathe- 
matical statistics. In mathematical economics, the opposite of the loss (or cost) function gives the concept 
of the utility function, which quantifies (in a specific functional form) what is considered important and 
robust in the fit of the model to the data. We use Vp OS t e rior in an even more general sense than "util- 
ity," as a decision and information-based valuation that supports risk-informed decision-making based on 
"satisficing" [35] (see the concrete examples discussed below). 

While expression (1) is reminiscent of a Bayesian analysis, it does not deal with probabilities. In the 
Bayesian methodology of validation [29, 30], only comparison between models can be performed due to 
the need to remove the unknown probability of the data in Bayes's formula. In contrast, our approach 
provides a value for each single model independently of the others. In addition, it emphasizes the impor- 
tance of quantifying the novelty of each test and takes a more general view on how to use the information 
provided from the goodness-of-fit. The valuation (1) of a model uses probabilities as partial inputs, not as 
the qualifying criteria for model validation. This does not mean however that there are not uncertainties in 
these quantities or in the terms F, q or c novc i and that aleatory and systemic uncertainties [36] are ignored, 
as discussed below. 



Properties of the Multiplier of the Validation Step 

The multiplier F \p{M\y ^ s ), q; c novc i] should have the following properties: 

1. If the statistical test(s) performed on the given observations is (are) passed at the reference level q, 
then the posterior potential value is larger than the prior potential value: F > 1 (resp. F < 1) for 
p > q (resp. p < q), which can be written succinctly as InF/ ln(p/q) > 0. 

2. The larger the statistical significance of the passed test, the larger the posterior value. Hence 

dF 
op 

for a given q. There could be a saturation of the growth of F for large p/q, which can be either 
that F < oo as p/q — > oo or of the form of a concavity requirement d 2 F/dp 2 < for large p/q: 
obtaining a quality of fit beyond a certain level should not be attempted. 

3. The larger the statistical level at which the test(s) performed on the given observations is (are) passed, 
the larger the impact of a "novel" experiment on the multiplier enhancing the prior into the posterior 
potential value of the model: dF/dc novc \ > (resp. < 0), for p > q (resp. p < q). 
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The simplest form obeying these properties (not including the saturation of the growth of F) is 



F \p(M\y ohs ),q;c nove i] = 



(5) 



This form provides an intuitive interpretation of the meaning of the experiment impact parameter c noV ei- 
A bland evaluation of the novelty of a test would be c novc i = 1, thus F = p/q and the chain (3) reduces 
to a product of normalized likelihoods, as in standard statistical tests. A value c nov ci > 1 (resp. < 1) for a 
given experiment describes a nonlinear rapid (resp. slow) updating of our trust V as a function of the grade 
p/q of the model with respect to the observations. In particular, a large value of c nov ci corresponds to the 
case of "critical" tests. A famous example is the Michelson-Morley experiment for the Theory of Special 
Relativity. For the Theory of General Relativity, it was the observation during the 1919 solar eclipse of the 
bending of light rays from distant stars by the Sun's mass and the anomalous precession of the perihelion 
of Mercury's orbit. 

Note that the parameterization (5) should account for the decreased novelty noted above occurring 
when the same experiment is repeated two or more times. The value of c novc i should be reduced for each 
repetition of the same test; moreover, the value of c novc i should approach unity as the number of repetitions 
increases. 

The alternative multiplier, 



is plotted in Fig. 1 as a function of p/q and c nove i. It emphasizes that F saturates as a function of p/q and 
Cnovci as either one or both of them grow large. A completely new experiment corresponds to c noV ei - * °o 
so that l/c nove i = and thus F tends to [taxih(p / q) / tanh(l)] 4 , i.e., Vpostcrior/Vprior is only determined 
by the quality of the "fit" of the data by the model quantified by p/q. A finite c nove i implies that one 
already takes a restrained view on the usefulness of the experiment since one limits the amplitude of the 
gain = Vposterior/^prior* whatever the quality of the fit of the data by the model. The exponent 4 in (6) 
has been chosen so that the maximum confidence gain F is equal to l/(tanh(l)) 4 rs 3 in the best possible 
situation of a completely new experiment (c nove i = oo) and perfect fit (p/q — ► oo). In contrast, the 
multiplier F can be arbitrarily small as p/q —> even if the novelty of the test is high (c nove i — > oo). For 
a finite novelty c nove i, a test that fails the model miserably (p/q « 0) does not necessarily reject the model 
completely: unlike with the expression in (5), F remains greater than zero. Indeed, if the novelty c nove i 
is small, the worst-case multiplier (attained for p/q = 0) is [tanh (l/c novc i) / tanh (1 + (l/c novc i))] 4 « 
1 — 6.9e~ 2 / Cnovcl , which is only slightly less than unity if c nove i <C 1. In short, this formulation does not 
heavily weight unimportant tests. 

In the framework of decision theory, expression (1) with one of the specific expressions in (5) or (6) 
provides a parametric form for the utility or decision "function" of the decision maker. It is clear that many 
other forms of the utility function can be used, however, with the constraint of keeping the salient features 
of expression (1) with (5) or (6), in terms of the impact of a new test given past tests, and the quality of the 
comparison between the model predictions and the data. 

Finally, we remark that the proposed form for the multiplier (6) contains an important asymmetry 
between gains and losses: the failure to a single test with strong novelty and significance (as, e.g., for the 
localized seismicity on faults in the case of the OFC model and for the leverage effect in the case of the 
MRW model discussed below) cannot be compensated by the success of all the other tests combined. In 
other words, a single test is enough to reject a model. This embodies the common lore that reputation 
gain is a slow process requiring constancy and tenacity, while its loss can occur suddenly with one single 
failure and is difficult to re-establish. We believe that the same applies to the build-up of trust in and, thus, 
validation of a model. 




(6) 
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Practical Guidelines for Determining p/q and c nov ei 

These two crucial elements of a validation step are conditioned by four basic problems, over which one 
can exert at least partial control. In particular, they address the two sources of uncertainty: systemic 
(lack of knowledge, important missing mechanisms) and aleatory [36] (due to variability inherent in the 
phenomenon under consideration). In a nutshell, as becomes clearer below, the comparison between p and 
q is more concerned with the aleatory uncertainty while c nov ci deals in part with the systemic uncertainty. 
In the following, as in the two examples (5) and (6), we consider that p and q enter only in the form of their 
ratio p/q. This should not be generally the case but, given the many uncertainties, this restriction seems to 
simplifly the analysis by removing a degree of freedom. 

1. How to model? This addresses model construction and involves the structure of the elementary 
contributions, their hierarchical organization, and requires dealing with uncertainties and fuzziness. 
This concerns the epistemic uncertainty. 

2. What to measure? This relates to the nature of c Ilovc i: ideally, following Palmer et al. [37], one 
should target adaptively the observations to "sensitive" parts of the system. Targeting observations 
could be directed by the desire to access the most "relevant" information as well as to get information 
that is the most reliable, i.e., which is contaminated by the smallest errors. This is also the stance 
of Oberkampf and Trucano [38]: "A validation experiment is conducted for the primary purpose 
of determining the validity, or predictive accuracy, of a computational modeling and simulation 
capability. In other words, a validation experiment is designed, executed, and analyzed for the 
purpose of quantitatively determining the ability of a mathematical model and its embodiment in 
a computer code to simulate a well-characterized physical process." In practice, c novc i is chosen 
to represent the best guess-estimate of the importance of the new observation and the degree of 
"surprise" it brings to the validation step [39]. The epistemic uncertainty alluded to above is partially 
addressed in the choice of the empirical data and its rating c novc i (see the examples of application 
discussed below). 

3. How to measure? For given measurements or experiments, the problem is to find the "optimal" 
metric or cost function (involved in the quality-of-fit measure p) for the intended use of the model. 
The notion of optimality needs to be defined. It could capture a compromise between fitting best the 
"important" features of the data (what is "important" may be decided on the basis of previous studies 
and understanding or other processes, or programmatic concerns), and minimizing the extraction of 
spurious information from noise. This requires one to have a precise idea of the statistical properties 
of the noise. If such knowledge is not available, the cost function should be chosen accordingly. The 
choice of the "cost function" involves the choice of how to look at the data. For instance, one may 
want to expand the measurements at multiple scales using wavelet decompositions and compare the 
prediction and observations scale by scale, or in terms of multifractal spectra of the physical fields 
estimated from these wavelet decompositions [40] or from other methods. The general idea here is 
that, given complex observation fields, it is appropriate to unfold the data on a variety of "metrics," 
which can then be used in the comparison between observations and model predictions: the question 
is then how well is the model able to reproduce the salient multiscale and multifractal properties 
derived from the observations? The physics of turbulent fields and of complex systems have offered 
many such new tools with which to unfold complex fields according to different statistics. Each 
of these statistics offers a metric to compare observations with model predictions and is associated 
with a cost function focusing on a particular feature of the process. Since these metrics are derived 
from the understanding that turbulent fields can be analyzed using these metrics that reveal strong 
constraints in their organization, these metrics can justifiably be called "physics-based." In practice, 
p, and eventually p/q, has to be inferred as an estimate of the degree of matching between the model 
output and the observation. This can be done following the concept of fuzzy logic in which one 
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replaces the yes/no pass test by a more gradual quantification of matching [41]. We thus concur 
with Ref. [42], while our general methodology goes beyond. Note that this discussion relates to the 
aleatory uncertainty [36]. 

4. How to interpret the results? This question relates to defining the test and the reference probability 
level q that any other model (than the one under scrutiny) can explain the data. The interpretation of 
the results should aim at detecting the "dimensions" that are missing, unspecified or erroneous in the 
model (systemic uncertainty). What tests can be used to betray the existence of hidden degrees of 
freedom and/or dimensions? This is the hardest problem. It can sometimes find an elegant solution 
when a given model is embedded in a more general one. Then, the limitation of the "smaller" model 
becomes clear from the vantage of the more general model. 

We now illustrate our algorithmic approach to model validation using the historical development of quan- 
tum mechanics and four examples based on the authors' research activities. In these examples, we will use 
the form (6) and consider three finite values: c novc i = 1 (marginally useful new test), c novc i = 10 (substan- 
tially new test), and c novc i = 100 (important new test). When a likelihood test is not available, we propose 
to use three possible marks: p/q = 0.1 (poor fit), p/q = 1 (marginally good fit), and p/q = 10 (good fit). 
Extreme values (c nove i or p/q are or oo) have already been discussed. Due to limited experience with 
this approach, we propose these ad hoc values in the following examples of its application. 

Quantum Mechanics 

Quantum mechanics (QM) offer a vivid incarnation of how a model can turn progressively into a theory 
held "true" by almost all physicists. Since its birth, QM has been tested again and again because it presents 
a view of "reality" that is shockingly different from the classical view experienced at the macroscopic scale. 
QM prescriptions and predictions often go against classical intuition. Nevertheless, we can state that, by a 
long and thorough process of verified predictions of QM in experiments, fueled by the imaginative set-up of 
paradoxes, QM has been validated as a correct description of nature. It is fair to say that the overwhelming 
majority of physicists have developed a strong trust in the validity of QM. That is, if someone comes up 
with a new test based on a new paradox, for instance, most physicists would bet that QM will come up 
with the right answer with a very high probability. It is thus by the on-going testing and the compatibility 
of the prediction of QM with the observations that QM has been validated. As a consequence, one can use 
it with strong confidence to make predictions in novel directions. This is ideally the situation one would 
like to attain for the problem of validation of models discussed below. We now give a very partial list of 
selected tests that established the trust of physicists in Quantum Mechanics. 

1. Pauli's exclusion principle states that no two identical fermions (particles with non-integer values 
of spin) may occupy the same quantum state simultaneously [43]. It is one of the most important 
principles in physics, primarily because the three types of particle from which ordinary matter is 
made, electrons, protons, and neutrons, are all subject to it. With c nov ei = 100 and perfect agreement 
in numerous experiments (p/q = oo), this leads to = 2.9. 

2. The EPR paradox [44] was a thought experiment designed to prove that quantum mechanics was 
hopelessly flawed: according to QM, a measurement performed on one part of a quantum system can 
have an instantaneous effect on the result of a measurement performed on another part, regardless of 
the distance separating the two parts. Bell's theorem [45] showed that quantum mechanics predicted 
stronger statistical correlations between entangled particles than the so-called local realistic theory 
with hidden variables. The importance of this prediction requires c nove i = 100 at the very minimum. 
The QM prediction turned out to be correct, winning over the hidden-variables theories [46, 47] 
(p/q = oo), leading again to = 2.9. 
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3. The Aharonov-Bohm effect predicts that a magnetic field can influence an electron that, strictly 
speaking, is located completely beyond the field's range, again an impossibility according to non- 
quantum theories (c nov ci = 100). The Aharonov-Bohm oscillations were observed in ordinary (i.e., 
nonsuperconducting) metallic rings, showing that electrons can maintain quantum mechanical phase 
coherence in ordinary materials [48]. This yields p/q = oo and thus = 2.9 yet again. 

4. The Josephson effect provides a macroscopic incarnation of quantum effects in which two super- 
conductors are predicted to preserve their long-range order across an insulating barrier, for instance, 
leading to rapid alternating currents when a steady voltage is applied accross the superconduc- 
tors. The novelty of this effect again warrants c noV ei = 100 and the numerous verifications and 
applications (for instance in SQUIDs: Superconducting QUantum Interference Devices) argues for 
p/q = oo and thus = 2.9, as usual. 

5. The prediction of possible collapse of a gas of atoms at low temperature into a single quantum state is 
known as Bose-Einstein (BE) condensation, again so much against classical intuition (c nove i = 100). 
Atoms are indeed bosons (particles with integer values of spin), which are not subjected to the Pauli 
exclusion principle evoked in the above test #1 of QM. The first such BE condensate was produced 
using a gas of rubidium atoms cooled to 1.7 • 10~ 7 K [49] (p/q = oo), leading once more to 
F( 4 ) = 2.9. 

6. There have been several attempts to develop a paradox-free nonlinear QM theory, in the hope of 
eliminating Schrodinger's cat paradox, among other embarrassments. The nonlinear QM predictions 
diverge from those of orthodox quantum physics, albeit subtly. For instance, if a neutron impinges 
on two slits, an interference pattern appears, which should, however, disappear if the measurement 
is made far enough away (c nove i = 100). Experiment tests of the neutron prediction rejected the 
nonlinear version in favor of the standard QM [50] (p/q = oo), leading to F^ = 2.9. 

7. In addition, measurements at the National Bureau of Standards in Boulder, CO, on frequency stan- 
dards have been shown to set limits of order 10~ 21 on the fraction of the energy of the rf transi- 
tion in 9 Be ions that could be due to nonlinear corrections to quantum mechanics [51]. We assign 
Cnovei = 10, with p/q = 10), to this result, leading to F^ = 2.4. Although less than i^ 1 " 6 ) this is 
still meant to be an impressive score. 

Combining the multipliers according to (3) leads to V^ terior /V^„ r ~ 1400, which is of course only a 
lower limit given the many other validation tests not mentioned here. 

Tests of QM are ongoing [52]. But given the presumably huge amount of trust physicists have in 
QM which we tried to quantify, why do physicists still feel the need to put QM to the "validation test?" 
This raises the question whether we can ever establish a sense of sufficiency for validation. Our position 
is that this reflects the quixotic quest for the absolute truth, and also the taste for surprises. Perhaps, by 
continuing to test QM, humans will uncover a new insight or an anomaly which may help progress in the 
understanding of reality. 

Four Further Examples Drawn from the Authors' Research Activities 

The Olami-Feder-Christensen (OFC) sand-pile model of earthquakes. This is perhaps the simplest sand- 
pile model of self-organized criticality, which exhibits a phenomenology resembling real seismicity [53]. 
Figure 2 shows a "stress" map generated by the OFC model immediately after a large avalanche (main 
shock) at two magnifications, to illustrate the rich organization of almost synchronized regions [54]. To 
validate the OFC model, we examine the properties and prediction of the model that can be compared with 
real seismicity, together with our assessment of their c novc i and quality-of-fit. We are careful to state these 
properties in an ordered way, as specified in the above sequences (2)-(3). 
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1. The statistical physics community recognized the discovery of the OFC model as an important step 
in the development of a theory of earthquakes: without a conservation law (which was thought 
before to be an essential condition), it nevertheless exhibits a power law distribution of avalanche 
sizes resembling the Gutenberg-Richter (GR) law [53]. On the other hand, many other models 
with different mechanisms can explain observed power law distributions [55]. We thus attribute 
only Cnovel = 10 to this evidence. Because the power law distribution obtained by the model is of 
excellent quality for a certain parameter value (a « 0.2), we formally take p/q = oo (perfect fit). 
Expression (6) then gives = 2.4. 

2. Prediction of the OFC model concerning foreshocks and aftershocks, and their exponents for the 
inverse and direct Omori laws. These predictions are twofold [56]: (i) the finding of foreshocks and 
aftershocks with similar qualitative properties, and (ii) their inverse and direct Omori rates. The first 
aspect, deserves a large c noV ei = 100 as the observation of foreshocks and aftershocks came as a 
rather big surprise in such sand-pile models [57]. The clustering in time and space of the foreshocks 
and aftershocks are qualitatively similar to real seismicity [56], which warrants p/q = 10, and thus 
p{2a) = 2.9 xhe second aspect is secondary compared with the first one (c nove i = 1). Since the 
exponents are only qualitatively reproduced (but with no formal likelihood test available), we take 
p/q = 0.1. This leads to = 0.47. 

3. Scaling of the number of aftershocks with the main shock size (productivity law) [56]: c novc i = 10 
as this observation is rather new but not completely independent of the Omori law. The fit is good 
so we grant a grade p/q = 10 leading to F^ = 2.4. 

4. Power law increase of the number of foreshocks with the mainshock size [56] : this is not observed in 
real seismicity, probably because this property is absent or perhaps due to a lack of quality data. This 
test is therefore not very selective (c nove i = 1) and the large uncertainties suggest a grade p/q = 1 
(to reflect the different viewpoints on the absence of effect in real data) leading to F^ = 1 (neutral 
test). 

5. Most aftershocks are found to nucleate at "asperities" located on the mainshock rupture plane or 
on the boundary of the avalanche, in agreement with observations [56]: c nove i = 10 and p/q = 10 
leading to F^ = 2.4. 

6. Earthquakes cluster on spatially localized geometrical structures, known as faults. This property is 
arguably central to the physics of seismicity (c novc i = 100), but absolutely not reproduced by the 
OFC model (p/q = 0.1). This leads to F^ = 4 • 10" 4 . 

Combining the multipliers according to (3) up to test #5 leads to V^osterior/V^rior = 18.8, suggesting 
that the OFC model is validated as a useful model of the statistical properties of seismic catalogs, at least 
with respect to the properties which have been examined in these first five tests. Adding the crucial last 
test strongly fails the model since posterior /^pHor = 0.0075. The model can not be used as a realistic 
predictor of seismicity. It can nevertheless be useful to illustrate certain statistical properties and to help 
formulate new questions and hypotheses. 

The multifractal random walk (MRW) as a model of financial returns. We now consider the MRW model 
introduced as a random walk with stochastic "volatility" endowed with exact multifractal properties [58], 
which has been proposed as a model of financial time series. Among the documented facts about financial 
time series, we have the absence of correlation between lagged returns, the long-range correlation of lagged 
volatilities, and the observed multifractality. These can not be taken as validation tests of the model since 
they are the observations that motivated the introduction of the MRW. These observations thus constitute 
references or benchmarks against which new tests must be compared. The new properties and prediction 
of the MRW model that can be compared with real financial return time series are the following. 



12 



1. The probability density distributions (PDF) of returns at different time scales (see Figure 3): the 
MRW exhibits the remarkable property of accounting quantitatively for the transition from fatter- 
than-exponential PDFs at small time scales to approximately Gaussian PDFs at large time scales. 
But, because the MRW is intrinsically a model developed as the continuous limit of a cascade 
across scales, this is perhaps not very surprising. We thus rate the novelty of this observation with 
Cnovci = 10. In absence of formal likelihood tests on the PDFs, we take p/q = 10 to reflect the 
apparent excellent fits of the data at multiple scales, leading to F^ = 2.4. 

2. Different response functions of the price volatility to large external shocks compared with endoge- 
nous shocks, which are well-confirmed quantitatively by observations on a hierarchy of volatility 
shocks [59]. This prediction has been verified to hold with remarkable accuracy without any ad- 
justable parameters (i.e., the parameters were adjusted previously and fixed before the new test). 
We thus rate the novelty of this test with a high c nov ei = 100 and the agreement is quantified by 
p/q = 10, leading to = 2.9. 

3. The sharp-peak/flat-trough pattern of price peaks [60] as well as accelerated speculative bubbles 
preceding crashes [61] is not captured by the MRW. In view of the debated importance of such 
patterns, we rate these observations with c nov ci = 1 and p/q = 0.1, leading to = 0.47. 

4. The leverage effect and volatility dependence on past volatility and returns (see [62] and references 
therein). These features are not captured by the MRW at all. We rate c nov ci = 10 and the lack of 
agreement is quantified by p/q = 0.1, leading to F^ = 0.0037. 

Combining the multipliers according to (3) leads to V^^ CTior /V^j!jor = 0.012, rejecting the model. But 

if we stop the validation steps at Vp^l terioi /V^ OI = 7, we obtain a clear validation signal. The two 
additional tests fail the MRW because the observed effects involve mechanisms that are absent in it. Here, 
we should conclude that the MRW is a useful model that is validated with respect to certain properties 
on the memory of volatility but is not validated for a fully faithful description of the stock market returns. 
These mechanisms can be actually incorporated into extensions of the MRW, corresponding to the addition 
of new dimensions lacking in the MRW. If we had used the long-range correlation of lagged volatilities 
and the observed multifractality (each with parameters c novc i = 10 and p/q = 10) as tests #-1 and #0, F 
would have gained a factor 2.4 2 = 5.9, changing V^ tcli jV^ OI = 0.012 into V^ stclioT /V^ = 0.07, 
still far from sufficient to validate the model. 

An anomalous diffusion model for solar photons in cloudy atmospheres. To properly model climate dy- 
namics, it is important to reduce the significant uncertainty associated with clouds. In particular, estimation 
of the radiation budget in the presence of clouds needs to be improved since current operational models 
for the most part ignore all variability below the scale of the climate model's grid (a few 100 km). So a 
considerable effort has been expended to derive more realistic mean-field radiative transfer models [63], 
mostly by considering only the one-point variability of clouds (that is, irrespective of their actual struc- 
ture). However, it has been widely recognized that the Earth's cloudiness is fractal over a wide range of 
scales [64]. This is the motivation for modeling the paths of solar photons at non-absorbing wavelengths in 
the cloudy atmosphere as convoluted Levy walks [55], which are characterized by frequent small steps (in- 
side clouds) and occasional large jumps (typically between clouds) as represented schematically in Fig. 4. 
These paths start downward at the top of the highest clouds and end in escape to space or in absorption at 
the surface. In sharp contrast with most other mean-field models for solar radiative transfer, this diffusion 
model with anomalous scaling can be subjected to a battery of observational tests. 

1. The original goal of this phenomenological model, which accounts for the clustering of cloud water 
droplets into broken and/or multi-layered cloudiness, was to predict the increase in steady-state 
flux transmitted to the surface compared to what would filter through that same amount of water 



13 



in a single unbroken cloud layer [65]. This property is common to all mean-field photon transport 
models that do anything at all about unresolved variability [63]. Thus, we assign only c nove i = 10 
to this test and, given that all models in this class are successful, we have to take p/q = 1, hence 
pi 1 ) = i. The outcome of this first test is neutral. 

2. The first real test for this model occurred when it became possible to accurately estimate the mean 
total path of solar photons that reach the surface. This breakthrough was enabled by access to 
spectroscopy at medium (high) resolution of oxygen bands (lines) [66, 67]. Along with simultaneous 
estimation of cloud optical depth (basically, column-integrated water [kg/m 2 ] times the average 
scattering cross-section per kg), the observed trends were explained only by the new model in spite 
of the relatively large instrumental error bars. So we assign c novc i = 100 to this highly discriminating 
test and p/q = 10 (even though other models were generally not in a position to compete), hence 
F( 2 ) = 2.9. 

3. Another test was proposed using time-dependent photon transport with a source near the surface 
(cloud-to-ground lightning) and a detector in space (the DOE FORTE satellite) [68]. The quantity of 
interest is the observed delay of the light pulse (due to multiple scattering in the cloud system) with 
respect to the radio-frequency pulse (which travels in a straight line). There was no simultaneous 
estimate of cloud optical depth, so assumptions had to be made (informed by the fact that storm 
clouds are at once thick and dense). Because of this lack of an independent measurement, we assign 
only c n ovci = 10 to the observation and p/q = 1 to the model performance since this is only about 
the finite horizontal extent of the cloud (one could exclude only uniform "plane-parallel" clouds). 
So, again we obtain F^ 3 ) = 1 for an interesting but presently neutral test that needs to be refined. 

4. Min et al. [69] developed an oxygen-line spectrometer with sufficient resolution to estimate not just 
the mean path but also its root-mean-square (RMS) value. They found the prediction by Davis and 
Marshak [70] for normal diffusion to be an extreme (envelop) case for the empirical scatter plot of 
mean vs. RMS path, and this is indicative that the anomalous diffusion model will cover the bulk 
of the data. Because of some overlap with a previous item, we assign c nove i = 10 and p/q = 10 
for the model performance (since the anomalous diffusion model had not yet made a prediction for 
the RMS path, but other models have yet to make one for the mean path). We therefore obtain 
i?(4) = 2.4. 

5. Using similar data but a different normalization than Min et al.'s, more amenable to model testing, 
Scholl et al. [71] observed that the RMS-to-mean ratio for solar photon path is essentially constant 
whether the diffusion is normal or anomalous. This is a remarkable empirical finding to which we 
assign c nove i = 100. The new mean- and RMS-path data was explained by Scholl et al. by creating 
an ad hoc hybrid between the normal diffusion theory (which made a prediction for the RMS path) 
and its anomalous counterpart (which did not). This significant modification of the basic model 
means that we are in principle back to validation step 1 with the new model. However, this exercise 
uncovered something quite telling about the original anomalous diffusion model, namely, that its 
simple asymptotic (large optical depth) form used in all the above tests is not generally valid: for 
typical cloud covers, the pre-asymptotic terms computed explicitly for the normal diffusion case 
prove to be important irrespective of whether the diffusion is normal or not. Consequently, in its 
original form (a simple scaling law for the mean path with respect to cloud thickness and optical 
depth), the anomalous diffusion model fails to reproduce the new data even for the mean path. 
(This means that previous fits yielded "effective" anomaly parameters and were misleading if taken 
literally.) So we assign p/q = 0.1 at best for the original model, hence = 0.0004. 

Thus, V^osterior/V^rior = 0.003, a fatal blow for the anomalous diffusion in its simple asymptotic form, 
even though VL^ terior /V^) QI = 7.0 which would have been interpreted as close to a convincing validation. 
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Of course, this is not the end of the story. The original model has already spawned Scholl et al.'s empirical 
hybrid and there is a formalism based on integral (in fact, pseudo-differential) operators that extends 
the anomalous diffusion model to pre-asymptotic regimes [72]. More recently, a model for anomalous 
transport (i.e., where angular details matter) has been proposed that fits all of the new oxygen spectroscopy 
results [73]. 

In summary, the first and simplest incarnation of the anomalous diffusion model for solar photon trans- 
port ran its course and demonstrated the power of oxygen-line spectroscopy as a test for the perfomance 
of solar radiative transfer models required in climate modeling for large-scale average properties. Even- 
tually, new and interesting tests will become feasible when we obtain dedicated oxygen-line spectroscopy 
from space (with NASA's Orbiting Carbon Observatory mission planned for launch in 2007). Indeed, we 
already know that the asymptotic scaling for reflected photon paths [74] is different from their transmitted 
counterparts [70] in both mean and RMS. 

A computational fluid dynamics (CFD) model for shock-induced mixing and shock-tube tests. So far, our 
examples of models for complex phenomena have hailed from quantum and statistical physics. In the latter 
case, they are stochastic models composed of: (1) simple code (hence rather trivial verification procedures) 
to generate realizations, and (2) analytical expressions for the ensemble-average properties (that are used 
in the above validation exercises). We now turn to gas dynamics codes which have a broad range of 
applications, from astrophysical and geophysical flow simulation to the design and performance analysis of 
engineering systems. Specifically, we discuss the validation of the "Cuervo" code developed at Los Alamos 
National Laboratory [75]. This software, which generates solutions of the compressive Euler equations, 
have been verified against a suite of test problems having closed-form solutions; as clearly pointed out by 
Oberkampf and Trucano [38], however, this differs from and also does not guarantee validation against 
experimental data. A standard test case involves the Richtmyer-Meshkov (RM) instability [76, 77], which 
arises when a density gradient in a fluid is subjected to an impulsive acceleration, e.g., due to passage of a 
shock wave (see Fig. 5). Evolution of the RM instability is nonlinear and hydrodynamically complex and 
hence defines an excellent problem-space to assess CFD code performance. 

In the series of shock-tube experiments described in [78], RM dynamics are realized by preparing one 
or more cylinders with approximately identical axisymmetric Gaussian concentration profiles of dense 
sulfur hexaflouride in air. This (or these) vertical "gas cylinder(s)" is (are) subjected to a weak shock 
— Mach number wl.2 — propagating horizontally. The ensuing dynamics are largely governed by the 
mismatch of the density gradient between the gases (with the density of SFg approximately five times that 
of air) and the pressure gradient through the shock wave; this mismatch acts as the source for baroclinic 
vorticity generation. The visualization of the density field is obtained using a planar laser-induced fluo- 
rescence (PLIF) technique, which provides high-resolution quantitative concentration measurements. The 
velocity field is diagnosed using particle image velocimetry (PIV), based on correlation measurements of 
small-scale particles that are lightly seeded in the initial flow field. Careful post-processing of images from 
130 jits to 1000 fis after shock passage yields planar concentration and velocity with error bars. 

1. The RM flow is dominated at early times by a vortex pair (per gas cylinder). Later, secondary 
instabilities rapidly transition the flow to a mixed state. We rate c nove i = 10 for the observations of 
these two instabilities. The Cuervo code correctly captures these two instabilities, best observed and 
modeled with a single cylinder. At this qualitative level, we rate p/q = 10 (good fit), which leads to 

= 2.4. 

2. Older data for two-cylinder experiments acquired with a fog-based technique (rather than PLIF) 
showed two separated spirals associated with the primary instability, but the Cuervo code predicted 
the existence of a material bridge. This previously unobserved connection was experimentally 
diagnosed with the improved observational technique. Using c novc i = 10 and p/q = 10 yields 
F( 2 ) = 2.4. 
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3. The evolution of the total power as a function of time offers another useful metric. The numerical 
simulation quantitatively accounts for the exponential growth of the power with time, within the 
experimental error bars. Using c noV ei = 10 and p/q = 10 yields = 2.4. 

4. The concentration power spectrum as a function of wavenumber for different times provides another 
way (in the Fourier domain) to present the information of the hierarchy of structures already visual- 
ized in physical space (c nove i = 1). The Cuervo code correctly accounts for the low wavenumber part 
of the spectrum but underestimates the high wavenumber part (beyond the deterministic-stochastic 
transition wavenumber) by a factor 2 to 5. We capture this by setting p/q = 0.1, which yields 
F( 4 ) = 0.47. 

Combining the multipliers according to (3) leads to posterior / ^prior = 6-5, a significant gain, but still 
not sufficient to compellingly validate the Cuervo code for inviscid shock-induced hydrodynamic instabil- 
ity simulations. Intricate experiments with three gas cylinders have been performed [79] and others are 
currently under way to further stress CFD models. 

These examples illustrate the utility of representing the validation process as a succession of steps, 
each of them characterized by the two parameters c nove i and p/q. The determination of c nove i requires 
expert judgment and that of p/q a careful statistical analysis, which is beyond the scope of the present 
report (see Ref. [42] for a detailed case study). The parameter q is ideally imposed as a confidence level, 
say 95% or 99% as in standard statistical tests. In practice, it may depend on the experimental test and 
requires a case-by-case examination. 

The uncertainties of c nove i and of p/q need to be assessed. Indeed, different statistical estimations or 
metrics may yield different p/q's and different experts will likely rate differently the novelty c nov ei of a 

new test. As a result, the trust gain posterior /^prior a f ter n tests necessarily has a range of possible values 
that grows geometrically with n. In certain cases, a drastic difference can be obtained by a change of 
Cnovei : for instance, if instead of attributing c nove i = 100 to the sixth OFC test, we put c nove i = 10 (resp. 
1) while keeping p/q = 0.1, F^ 6 ) is changed from 4 • 10~ 4 to 4 • 10~ 3 (resp. 0.47). The trust gain then 

becomes Vp^ tCTioi /V^ OI = 0-07 (resp. ~ 9). For the sixth OFC test, c I10V ci = 1 is arguably unrealistic, 
given the importance of faults in seismology. The two possible choices c novc i = 100 and c novc i = 10 
then give similar conclusions on the invalidation of the OFC model. In our examples, V^^^ ioT /V^ OT 
provides a qualitatively robust measure of the gain in trust after n steps; this robustness has been built-in 
by imposing a coarse-grained quality to p/q and c nove i. 

Summary 

The validation of numerical simulations continues to become more important as computational power 
grows and the complexity of modeled systems increases, and as increasingly important decisions are in- 
fluenced by computational models. We have proposed an iterative, constructive approach to validation 
using quantitative measures and expert knowledge to assess the relative state of validation of a model in- 
stantiated in a computer code. In this approach, the increase/decrease in validation is mediated through a 
function that incorporates the results of the model vis-a-vis the experiment together with a measure of the 
impact of that experiment on the validation process. While this function is not uniquely specified, it is not 
arbitrary: certain asymptotic trends, consistent with heuristically plausible behavior, must be observed. In 
five fundamentally different examples, we have illustrated how this approach might apply to a validation 
process for physics or engineering models. We believe that the multiplicative decomposition of trust gains 
or losses (given in Eq. 3), using a suitable functional prescription (such as Eq. 6), provides a reasoned 
and principled description of the key elements — and fundamental limitations — of validation. It should be 
equally applicable to biological and social sciences, especially since it is built upon the decision-making 
processes of the latter. We believe that our procedure transforms the paralyzing criticisms in Popper's style 
that "we cannot validate, we can only invalidate" [14], espoused for instance by Konikov and Bredehoeft 
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in the context of groundwater models [80], into a practical constructive algorithm which addresses specif- 
ically both problems of distinguishing between competing models and transforming the vicious circle of 
lack of suitable data into a virtuous circle quantifying the evolving trust of a model based on the novelty 
and relevance of new data and the quality of fits. 
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Figure 2: (color online) Map of the "stress" field generated by the OFC model immediately after a large 
avalanche (main shock) at two magnifications. The upper panel shows the whole grid of size L = 1024 
and the lower plot represents a subset of the grid delineated by the square in the upper plot. Adapted from 
[56]. 
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Figure 3: Continuous deformation of the PDF of increments across scales, (a) MRW Model. Standardized 
PDF's (in logarithmic scale) of the MRW increments for 5 different time scales (from top to bottom), 
A = 16; 128; 2048; 8192; 32768. One can observe the continuous deformation and the appearance of fat 
tails when going from large to fine scales, (b) S&P500 future. Standardized PDF's of the returns at scales 
(from top to bottom) A = 10, 40, 160 min, 1 day, 1 week and one month. As in panel (a), the scale is 
logarithmic and plots have been arbitrarily shifted along vertical axis for clarity. Adapted from [58]. 
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Figure 4: Schematic representation of the anomalous diffusion model of solar photon transport at non- 
absorbing wavelengths in the cloudy atmosphere. In this model, solar beams follow convoluted Levy 
walks, which are characterized by frequent small steps (inside clouds) and occasional large jumps (between 
clouds or between clouds and the surface). The partition between small and large jumps is controlled by the 
Levy index a (the PDF of the jump sizes I has a tail decaying as a power law ~ l/£ 1+a with 1 < a < 2). 
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Figure 5: Schematic of the interactions between weakly shocked (Mach number M = 1.2) light gas (air) 
and a column of dense gas (SFq ). The Richtmyer-Meshkov instability occurs from the mismatch between 
the pressure gradient (at the shock front) and the density gradient (between the light and dense gases), 
which acts as a source of baroclinic vorticity. The column of dense gas "rolls up" into a double-spiral form 
under the action of the evolving vorticity. 
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